Modernize code generation for the external LLVM 22 back-end#3169
Merged
Conversation
Machine code is generated by an external, up-to-date LLVM, so target selection should not be limited by the in-process LLVM version (which only drives the middle end, and is not configured for any particular device). This makes the back-end compile natively for recent devices, e.g., sm_120a instead of sm_90 with a rewritten PTX header on Blackwell, and unlocks newer PTX ISAs.
Intrinsics unknown to the in-process LLVM, e.g. selected by libdevice's __CUDA_ARCH dispatch, were counted as undefined functions, needlessly compiling for relocatable code and linking against cudadevrt.
Old LLVM back-ends generated nonexistent min.NaN/max.NaN instructions for fast fp64 min/max and fp16 minimum/maximum (#2886); the external back-end lowers these correctly for every subtarget.
Plain atomicrmw fadd gives LLVM real atomic semantics instead of an opaque asm blob, generating the same instructions while remaining optimizable; the back-end also expands it on devices without native support. BFloat16 atomic addition is new (sm_90 hardware, expanded elsewhere), and requires Julia 1.11 for bfloat codegen support.
The intrinsic has no side effects, unlike the inline assembly it replaces, so it can be CSE'd, hoisted, and constant-folded.
The inline assembly lacked the side-effect flag, allowing LLVM to merge or hoist it across divergent control flow. Use the convergent intrinsic where available (LLVM 20), and mark the assembly side-effecting before.
The back-end aligns 128-bit integers to 16 bytes, but Julia versions before 1.12 align them to 8, so aggregates with (U)Int128 fields can lay out differently on host and device. These used to be compiled quietly, reading garbage on the device; error instead.
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 356c85b | Previous: 112549e | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
100163 ns |
99517 ns |
1.01 |
array/accumulate/Float32/dims=1 |
75240 ns |
75910 ns |
0.99 |
array/accumulate/Float32/dims=1L |
1628693 ns |
1594980 ns |
1.02 |
array/accumulate/Float32/dims=2 |
140970 ns |
141259 ns |
1.00 |
array/accumulate/Float32/dims=2L |
652567 ns |
653724 ns |
1.00 |
array/accumulate/Int64/1d |
118755 ns |
118852 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79140 ns |
79413 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1723746 ns |
1709492 ns |
1.01 |
array/accumulate/Int64/dims=2 |
153114 ns |
154250 ns |
0.99 |
array/accumulate/Int64/dims=2L |
960242 ns |
960390 ns |
1.00 |
array/broadcast |
18384 ns |
18461 ns |
1.00 |
array/construct |
1197.5 ns |
1193 ns |
1.00 |
array/copy |
16621 ns |
16550 ns |
1.00 |
array/copyto!/cpu_to_gpu |
213583 ns |
214764 ns |
0.99 |
array/copyto!/gpu_to_cpu |
278812 ns |
280613 ns |
0.99 |
array/copyto!/gpu_to_gpu |
10254 ns |
10344 ns |
0.99 |
array/iteration/findall/bool |
133353 ns |
134100 ns |
0.99 |
array/iteration/findall/int |
147614 ns |
147421 ns |
1.00 |
array/iteration/findfirst/bool |
69959 ns |
112673 ns |
0.62 |
array/iteration/findfirst/int |
71112 ns |
112820 ns |
0.63 |
array/iteration/findmin/1d |
67998 ns |
67036 ns |
1.01 |
array/iteration/findmin/2d |
101335 ns |
100960 ns |
1.00 |
array/iteration/logical |
193754 ns |
193400 ns |
1.00 |
array/iteration/scalar |
65567 ns |
64965 ns |
1.01 |
array/permutedims/2d |
49616 ns |
49581 ns |
1.00 |
array/permutedims/3d |
50731 ns |
50662 ns |
1.00 |
array/permutedims/4d |
50885 ns |
50962 ns |
1.00 |
array/random/rand/Float32 |
11550 ns |
12069 ns |
0.96 |
array/random/rand/Int64 |
22788 ns |
24024 ns |
0.95 |
array/random/rand!/Float32 |
7837.333333333333 ns |
8798.666666666666 ns |
0.89 |
array/random/rand!/Int64 |
17838 ns |
20664 ns |
0.86 |
array/random/randn/Float32 |
35484 ns |
35378 ns |
1.00 |
array/random/randn!/Float32 |
23789 ns |
23654 ns |
1.01 |
array/reductions/mapreduce/Float32/1d |
33624 ns |
33516 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1 |
38432 ns |
38509 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
50250 ns |
50248 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56205 ns |
55822 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2L |
67291 ns |
67519 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
39237 ns |
39436 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1 |
41561 ns |
41192 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=1L |
86410 ns |
86477 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
58727 ns |
57772 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=2L |
82371 ns |
83119 ns |
0.99 |
array/reductions/reduce/Float32/1d |
33454 ns |
33724 ns |
0.99 |
array/reductions/reduce/Float32/dims=1 |
38555 ns |
38486 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
50180 ns |
50211 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56128 ns |
55852 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
66986 ns |
69022 ns |
0.97 |
array/reductions/reduce/Int64/1d |
39115 ns |
39412 ns |
0.99 |
array/reductions/reduce/Int64/dims=1 |
41248 ns |
40972 ns |
1.01 |
array/reductions/reduce/Int64/dims=1L |
86521 ns |
86447 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
58247 ns |
57742 ns |
1.01 |
array/reductions/reduce/Int64/dims=2L |
83524 ns |
82671 ns |
1.01 |
array/reverse/1d |
16824 ns |
16903 ns |
1.00 |
array/reverse/1dL |
67720 ns |
67929 ns |
1.00 |
array/reverse/1dL_inplace |
65187 ns |
65328 ns |
1.00 |
array/reverse/1d_inplace |
8317.666666666666 ns |
10020.333333333334 ns |
0.83 |
array/reverse/2d |
20065 ns |
20099 ns |
1.00 |
array/reverse/2dL |
71781 ns |
71890 ns |
1.00 |
array/reverse/2dL_inplace |
64950 ns |
65089 ns |
1.00 |
array/reverse/2d_inplace |
9543 ns |
9724 ns |
0.98 |
array/sorting/1d |
2654742 ns |
2658878 ns |
1.00 |
array/sorting/2d |
1033402 ns |
1040327 ns |
0.99 |
array/sorting/by |
3180132 ns |
3193494 ns |
1.00 |
cuda/synchronization/context/auto |
1158.7 ns |
1122.1 ns |
1.03 |
cuda/synchronization/context/blocking |
931.6296296296297 ns |
908.9714285714285 ns |
1.02 |
cuda/synchronization/context/nonblocking |
6052.6 ns |
6022.8 ns |
1.00 |
cuda/synchronization/stream/auto |
989.4 ns |
993.9 ns |
1.00 |
cuda/synchronization/stream/blocking |
837.8115942028985 ns |
827.3783783783783 ns |
1.01 |
cuda/synchronization/stream/nonblocking |
5901.6 ns |
5915 ns |
1.00 |
integration/byval/reference |
143146 ns |
142979 ns |
1.00 |
integration/byval/slices=1 |
145285 ns |
145133 ns |
1.00 |
integration/byval/slices=2 |
283812 ns |
283763 ns |
1.00 |
integration/byval/slices=3 |
422279 ns |
422104 ns |
1.00 |
integration/cudadevrt |
101563 ns |
101484 ns |
1.00 |
integration/volumerhs |
8997466 ns |
9077118 ns |
0.99 |
kernel/indexing |
12705 ns |
12734 ns |
1.00 |
kernel/indexing_checked |
13427 ns |
13463 ns |
1.00 |
kernel/launch |
2058 ns |
2146.3333333333335 ns |
0.96 |
kernel/occupancy |
724.4328358208955 ns |
688.7569444444445 ns |
1.05 |
kernel/rand |
15233 ns |
14254 ns |
1.07 |
latency/import |
3850082067 ns |
3847133996 ns |
1.00 |
latency/precompile |
4630800385 ns |
4625229019 ns |
1.00 |
latency/ttfp |
4521745566 ns |
4496455873 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
The external back-end selects fast minnum/minimum to single min/max instructions instead of compare + select, picking the NaN-propagating variants where available. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Julia's floating-point min/max follow IEEE 754-2019 minimum/maximum semantics, which map directly onto these intrinsics. The external back-end legalizes them on every subtarget (min.NaN/max.NaN on sm_80+, a NaN/signed-zero-correct expansion elsewhere), so drop the manual emulation based on __nv_fmin plus a NaN fix-up. That emulation also inherited llvm.minnum's loose signed-zero semantics, causing constant folding to break the -0.0 < +0.0 ordering on device. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
device_layout called sizeof on every zero-field DataType, but types like Symbol don't have a definite size. Non-isbits arguments are passed by reference, so their layout is Julia's business on both sides; treat them as opaque. Only affected Julia 1.10/1.11, where the layout check is active. Also add tests for the Int128 layout rejection itself. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that targets are selected based on the back-end LLVM, recent devices compile natively (e.g. sm_120a) rather than for an older baseline. Adjust the feature-set expectation to consult the back-end version, and accept the wider vector accesses (v2.b64) such targets prefer over v4.b32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3169 +/- ##
==========================================
- Coverage 16.33% 16.32% -0.02%
==========================================
Files 124 124
Lines 9875 9875
==========================================
- Hits 1613 1612 -1
- Misses 8262 8263 +1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
maleadt
added a commit
to JohnCobbler/CUDA.jl
that referenced
this pull request
Jun 11, 2026
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Machine code generation goes through an external LLVM 22
llcnow, with the in-process LLVM only driving the middle end. That makes a bunch of old workarounds unnecessary, and unlocks some functionality:llvm_compatreports the capabilities ofNVPTX_LLVM_Backend_jllinstead of the in-process LLVM, so PTX target selection isn't held back by the Julia-bundled version anymore. This adds sm_88 and sm_110 support.llvm.-prefixed declarations the in-process LLVM doesn't recognize no longer trigger device-runtime linking; the back-end lowers them.min.NaN.f64/max.NaN.f64instructions #2886, which is fixed in LLVM 21+.atomicrmw faddinstead of inline assembly, and BFloat16 gets native atomic add/sub on Julia 1.11+ (with a CAS fallback below sm_70 resp. sm_90).active_mask()callsllvm.nvvm.activemaskon LLVM 20+. The inline-assembly fallback for older versions is marked side-effecting, as it could previously be hoisted or merged across divergent control flow.exp2uses theex2.approxintrinsic.One caveat:
llcrecomputes the data layout from the triple, ignoring the module's, so 128-bit integers are always 16-byte aligned on the device. Julia only aligns them that way since 1.12, meaning aggregates with (U)Int128 fields may lay out differently on older hosts. Kernel arguments with such layout mismatches are now rejected with an error pointing at Julia 1.12.Also includes a test guarding against dynamically-indexed aggregate arguments being copied to local memory (the regression fixed by llvm/llvm-project#201772), and updates the fdiv/rcp PTX tests for the new back-end's lowering (
invnow selectsrcpinstructions, and fast Float64 division gets Newton refinement).